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(54) Apparatus and methods for sharing Idle workstations 



(57) The present invention relates to systems for 
sharing idle workstation computers that are connected 
together through a network and shared file system. 
More particularly, a user of a local host workstation may 
submit jobs for execution on remote workstattons. The 
systems of the present invention select a rennote host 
that is idle in accordance with a decentralized schedul- 
ing scheme and then continuously monitor the activity 
of the remote host on which the job is executing. If the 



system detects certain activity on the remote host by 
one of the remote host's primary users, the execution of 
the job is immediately suspended to prevent inconven- 
ience to the primary users. The system also suspends 
Job execution if the remote host's load average gets too 
high. Either way, the suspended job is migrated by se- 
lecting another idle remote workstation to resume exe- 
cution of the suspended job (from the point in time at 
which the last checkpoint occurred). 
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Description 

Cross Reference to Related Applications 

The present invention is related to the fotlowing In- 
ternational Patent Applications: "Persistent State 
Checkpoint And Restoration Systems.' PCX Patent Ap- 
plication No. PCT/US95/07629 and "Checkpoint And 
Restoration Systems For Execution Control," PCT Pat- 
ent Application No. PCT/US95/07660. both of which are 
assigned to the assignee of the present invention and 
both of which are incorporated herein by reference. Ad- 
ditionally, the following articles are also incorporated 
herein by reference: YM. Wang et al., 'Checkpointing 
and Its Applications," Symposium on Fault-Tolerant 
Computing . 1995, and G.S. Fowler, "The Shell as a 
Service," USENIX Conference Proceedings . June 
1993. 

Background of The Invention 

The present invention relates to networked compu- 
ter systems. In particular, the present invention relates 
to computing environments that are populated with net- 
worked workstations sharing network file systems. One 
known system for sharing idle resources on networked 
computer systems is the Condor system that was devel- 
oped at the University of Wisconsin (idle resources refer, 
generally, to a networked workstation having no user in- 
put comnriands for at least a certain period of time). The 
Condor system is more fully described in M. Lrtzkow et 
al.. "Condor - a hunter of idle workstations.' Proc. ICD- 
CS . pp. 104-111. 1988. 

Some of the main disadvantages of the Condor sys- 
tem are as follows. Condor uses a centralized scheduler 
(on a centralized server) to alkxate network resources, 
which is a potential security risk. For example, having a 
centralized server requires that the server have root 
privileges in order for it to create processes that imper- 
sonate the individual users who submit jobs to it for re- 
mote execution. Any lack of correctness in the coding 
of the centralized server (e.g.. any 'bugs') may allow 
someone other than the authorized user to gain access 
to other user's privileges. Secondly, the Condor system 
migrates th e execution of a job on a remote host as soon 
as it detects any mouse or keytxard activity (i.e.. user 
inputs) on the renrK>te host. Therefore, it any user, in- 
cluding users other than the primary user, begins using 
the remote host, shared execution is terminated and the 
task is migrated. This causes needless migratkxi and 
its concomitant work lossage. Thirdly, due to the nature 
of a centralized server, starting the server can only be 
accomplished by someone with root privileges. 

Another known system for accomplishing resource 
sharing is the Coshell system (which is described more 
fully in the G.S. Fowler article The shell as a service,' 
incorporated by reference above). Coshell. unfortunate- 
ly also has disadvantages, such as the fact that Coshell 
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also suspends a job on the remote host whenever the 
remote host has any mouse or keyboard activity. Addi- 
tionally, Coshell cannot migrate a suspended job (a job 
that was executing remotely on a workstation and was 

5 suspended upon a user input at the remote workstation) 
to another machine, but has to wait until the mouse or 
keyboard activity ends on the remote host before resum- 
ing execution of the job. The fact that Coshell suspends 
the job's execution in response to any mouse or key- 

TO board activity creates needless suspensions of the job 
when any mouse or keyboard activity is sensed at the 
remote host. 

It would therefore be desirable to provide systems 
and methods of more efficiently sharing computational 

IS resources between networked workstations. 

It would also be desirable to provide program exe- 
cution on idle, remote workstations in which suspension 
of the program execution is reduced to further increase 
processing efficiency. 

20 It would be still further desirable to provide a net- 
work resource sharing architecture in which suspended 
jobs may be efficiently migrated to alternate idle work- 
stations when the initial idle, remote workstation is once 
again in use by the primary user. 

25 

Summary of the Invention 

The above and other objects of the invention are 
accomplished by providing methods for increasing the 
30 efficiency in sharing computational resources in a net- 
work architecture having multiple workstations in which 
at least one of those workstations is kjle. The present 
invention provides routines that, once a job is executing 
remotely, do not suspend that job unless the primary us- 
35 er 'retakes possession' of the workstation (versus any 
user other than the primary user). Additionally, the 
present inventton provides the capability to migrate a re- 
mote job. once it has been suspended, to another idle 
workstation in an efficient manner, rather than requiring 
40 that the job remain on the remote workstation in a sus- 
pended state until the renrrate workstatksn becomes idle 
again. 

Brief Description of the Drawings 

45 

The above and other objects of the present inven- 
tion will be apparent upon conskieratbn of the fotbwing 
detailed description, taken in conjunctkxi with the ac- 
companying drawings, in which like reference charac- 
50 ters refer to like parts throughout, and in whfch: 

FIG. 1 is an illustrative schematic diagram that de- 
picts substantially all of the processes that are typ- 
ically active when a single user, on the local host. 
ss has a single job running on a remote host in accord- 
ance with the principles of the present invention; 
and 

FIG. 2 is a table that depicts a representative sam- 
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pie of an attributes file in accordance with the prin- 
ciples of the present invention. 

Detailed Description of The Invention 

5 

FIG. 1 is an illustrative schematic diagram that de- 
picts substantially all of the processes that are typically 
active when a single user (whom shall be referred to as 
user X), on the local host, has a single job running on a 
remote host. FIG. 1 is divided into halves 100 and 101, io 
where half 100 depicts processes 102-105 running on 
the local host 100, and half 101 depicts processes 
106-110 running on a remote host 101. Thus, user X on 
local host 100 (i.e., the left half of FIG. 1) has a single 
job 109 running on a remote host 101 (i.e.. the right half 
of FIG. 1). 

Hosts 100 and 101 are connected together through 
a network and share a file systenn. WhHe FIG. 1 illus- 
trates a network having only two hosts, it should be un- 
derstood that the techniques of the present invention will 20 
typically be used in computing environments where 
there are multiple hosts, all being networked together, 
either directly together or in groups of networks (e.g., 
multiple local area networks, or LANs, or through a wide 
area network, or WAN), and sharing a file system. 25 

The process by which a user, such as user X, sub- 
mits a job for execution on a remote host is as follows, 
as illustrated by reference to FIG. 1 . The following ex- 
. ample assumes a point in time at which none of the proc- 
esses described in FIG. 1 have been started. In this il- 30 
lustrative example, the principles of the present inven- 
tion are applied to the Cosh ell system that was de- 
scribed and incorporated by reference above. 

Initially, a user starts a coshell process (a coshell 
process is a process that automatically executes shell 3S 
actions on lightly loaded hosts In a local network) as a 
daemon process (a process that, once started by a user 
on a workstation, does not terminate unless the user ter- 
minates it) on the user's local host. Thus, in FIG. 1, user 
X starts coshell process 104 as a daemon process on ^ 
local host 100. Every cosheil daemon process serves 
only the user who has started it. For example, while an- 
other user Y may log onto host 100, only user X can 
communicate with coshell process 104. 

Upon being started, a coshell daemon process ^ 
starts a status daenxxi (unless a status daemon is al- 
ready running), on its own local host, and on every other 
remote host to which the coshell may submit a job over 
the network. In the case of FIG. 1 . coshell 1 04 has start- 
ed status daemon 105 running on host 100 and status so 
daemon 110 running on host 101. 

Each status daemon collects certain current infor- 
mation regarding the host it is running on (i.e., status 
daemon 105 collects information about host 100, while 
status daemon 1 1 0 collects information about host 101) ss 
and posts this information to a unique status file which 
is visible to all the other networked hosts. This current 
information comprises: (i) the one minute toad average 



of the host as determined by the Unix command uptime, 
and (ii) the time elapsed sr>ce the last activity on an input 
device, either directly or through remote login, by a pri- 
mary user of the host. AJI of the status files of all the 
status daerrwns are kept in a central directory. A central 
directory is used to reduce the network traffic that would 
othenvise occur if, for example, every status daemon 
had to post its current infomiation to every other host on 
the network. In particular, a central directory of status 
files causes the network traffic to scale linearly with the 
number of hosts added. Furthermore, each status dae- 
mon posts its current information every 40 seconds plus 
a random fraction of 10 seconds - the random fraction 
is intended to further relieve network congestion by tem- 
porarily distributing the posting of current information. A 
coshell uses these status files to select an appropriate 
remote host upon which to run a job submitted to it by 
the cosheil's user. 

Each coshell selects a remote host independently 
of all of the other coshells that may be running and is 
therefore said to implement a form of decentralized 
scheduling. The coshells avoid the possibility of all sub- 
mitting their jobs to the same renxjte host by having a 
random smoothing component in their remote host se- 
lection algorithm. Rather than choosing a single best re- 
mote host, each coshell chooses a set of candidate re- 
mote hosts and then picks an individual host from within 
that set randomly. 

A coshell chooses a set of candidate renrvirte hosts 
based upon the current information in the status files 
and the statk: information in an attributes file. There is 
a single attrftxites fHe shared by all the coshells. For 
each host capable of being shared by the present inven- 
tion, the attributes file typfcatfy stores the following at- 
tributes: type, rating, idle, nnem and puser. An illustrative 
example of an attributes file is shown in the table of FIG. 
2. Each line of FIG. 2 contains, from left to right, the host 
name folbwed by assignments of values to each of the 
attributes for that host. 

The type attribute differentiates between host 
types. For example, the 'sgi.mips' in line 1 of FIG. 2 in- 
dk^ates a host of the Silicon Graphics workstatkxi type, 
while 'sun4" indicates a Sun Microsystems workstatk>n 
type. The rating attribute specifies the speed for a host 
in MIPS (millkyis of instructkxis per second). AJI hosts 
have preferably had their MIPS rating evaluated accord- 
ing to a common benchnr^ark program. This ensures the 
validity of MIPS comparisons between hosts. The idle 
attribute specifies the minimum tinrte which must elapse 
since the last primary user activity on a host (as posted 
in the host's status file) before the present inventwn can 
use the host as a remote host. The mem attribute de- 
scribes the size (in megabytes) of the nnain memory of 
the host, while the puser attrtt>ute describes the primary 
users of the host. For example, host 'banana* of FIG. 2 
has two primary users indicated by the quoted list "em- 
erald ymwang/ while host 'orange' only has the primary 
user 'ruby* (no quotes are necessary when there is only 
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one primary user). 

In genera^ the primary users attribute of a host is 
simply a subset of the universe of potential users of the 
host, selected as primary because they should be ac- 
corded extra priority in having access to the host's re- 
sources. 

Typically, if the host is physically located in a partic- 
ular person's office, then the primary users of that host 
would include the occupant of that office as well as any 
administrative assistants to the occupant. For certain 
computer installations, it is possible to ascertain the oc- 
cupant of the office containing the host from the name 
of the home directory on the host's physically attached 
disk. The names of these home directories can be col- 
lected automatically, over the network, thereby making 
it easier to create and maintain an attributes file. 

In addition to, or as an alternative to, the occupant 
of an office containing the host, the primary users of a 
host may include anyone who logs into the host from the 
host's console (where the console comprises the key- 
board and video screen which are physically attached 
to the host). 

Included among any other qualifications required of 
a remote host for it to be included in the set of candkJate 
remote hosts chosen by a coshell, is the requirement 
that the remote host be idle. A remote host is considered 
idle if all of the following three conditions are satisfied: 
(i) the one minute toad average (as posted in the hosfs 
status file) is less than a threshokf (with the threshold 
typically being approximately 0.5); (ii) the time elapsed 
since the last primary user activity (as posted in the 
host's status file) is less than the period of time specified 
for the host by the idle attribute in the attributes file 
(where the period of time is typically approximately 15 
minutes); and (iii) no nontrivlal jobs are being run by a 
primary user of the remote host. 

A nontrivial job is typically determined as follows. 
The Unix command 'ps' (which means processor sta- 
tus) is executed periodically with the execution of the 
two ps commands separated from each other by about 
one minute. An execution of the ps command returns, 
for each process on the host, a flag indicated whether 
or not the process was running as of the time the ps 
command was executed. If for both executions of ps the 
flags for a process indicate that the process was run- 
ning, then the process is nontrivtal. If a process is indi- 
cated as havrig stopped, for at least one of the two ps 
executions, then the process is considered trivial. 

Once a user has a coshell daemon running, the user 
can start a particular job running on a remote host by 
starting a submit program. The submit program takes 
as arguments the job's name as well as any arguments 
whch the job itself requires. The Job name can identify 
a system comnr»and. applicatton program object file or a 
shell script. The job name can also be an alias to any of 
the items listed in the previous sentence. In FIG. 1 , user 
X has started submit process 1 02 running with the name 
for job 1 09 as an argument. A submit process keeps run- 



ning until the job passed to it has finished executing on 
a remote host. Therefore, submit process 102 keeps 
running until job 109 is finished. 

If, for example, user X wants to run another job re- 
s motely, user X will have to start another submit process. 
Each such submit process keeps running until the job 
passed to it as an argument has finished executing on 
its rerrxjte host. Thus, there may be several submit proc- 
esses running at the same time for the same user on a 

10 given local host (e.g.. on local host 100). 

The first action a submit process takes upon being 
started is to spawn a coshell client process. For exam- 
ple, submit process 102 has started client process 103 
^ through spawning action 113. As used throughout the 

75 present application, spawning refers to the general 
process by which a parent process (e.g.. submit process 
102) creates a duplicate chikJ process (e.g.. client proc- 
ess 103). In Unix, the spawning action is accomplished 
by the for^csystem call. A client process provides an out- 

20 put on the local host for the standard error and standard 
output of the job executing on a remote host. A client 
process also has two way communication with its 
coshell via command and status pipes. Client process 
103 receives the standard error and the standard output 

2S of job 1 09. In addition, client process 1 03 communicates 
with its coshell 104 via command pipe 114 and status 
pipe 115. 

Once the client process has been started, its coshell 
then proceeds to select a remote host upon which to run 

30 the submitted job. As discussed above, each coshell se- 
lects a remote host independently of all the other 
coshells and utilizes random smoothing to prevent the 
overbading of any one remote host. If there are no idle 
renx5te hosts for a coshell to execute a submitted job 

35 upon, the coshell will queue the job until an idle remote 
host becomes available. Coshell 104, in the example 
shown in FIG. 1, has selected idle renrKDte host 101 to 
run the submitted job. 

Having selected a remote host, the coshell next 

^ starts a shell on the remote host ~ unless the coshell 
already has a shell running on that remote host left over 
from a previous job submissk^n by that coshell to that 
same remote host. Assuming that the coshell does not 
already have a shell running on the selected remote 

^ host, it will start one with the Unix conrvnand rsh. A 
coshell has two way communteatton with the shell it has 
created via a command pipe and a status pipe. In FIG. 
1, coshell 104 has started a shell ksh 106 on remote 
host 101. Coshell 104 and ksh 106 communicate via 

50 cofTvnand pipe 1 1 6 and status pipe 1 20, 

It should be noted that the type of shell started on 
a remote host by a coshell. referred to as a ksh. differs 
from an ordinary Unix shell only in its ability to send its 
standard output and standard error (produced by a re- 
motely submitted job) over the network and back to its 
coshell. The coshell then routes this standard output 
and standard error to the correct client process. In the 
case of FIG. 1, the standard output and standard error 
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of job 109, which is running in ksh 106, is sent over the 
network to coshel) 104. Coshell 104 then routes this 
standard output and standard error to client process 

103. 

Once a ksh is running on the selected remote host, s 
a program called LDSTUB is started under the ksh. An 
LDSTUB process is started under the ksh for each job, 
from the ksh's coshell, which is to be executed on the 
selected remote host. In the case of FIG. 1, LDSTUB 
process 107 is started under ksh 106. Upon being start- io 
ed. the LDSTUB process begins the execution of the 
user's job and a monitor daemon by spawning both of 
them off. Thus, LDSTUB process 107 has started mon- 
itor daemon 1 08 through spawn 1 1 9 and has started us- 
er X's job 109 through spawn 118. '5 

A monitor daerrton is a fine-grain polling process. It 
checks for the following conditions on the selected re- 
mote host: (i) any activity on an input devfce, either di- 
rectly or through remote login, by a primary user of the 
selected remote host; (it) any nontrivial jobs being run 20 
by a primary user of the selected remote host; or (iii) a 
one minute load average on the selected remote host 
(as determined by the monitor daenrton itself running the 
Unix command uptime) which is equal to or greater than 
a particular threshold (typk:ally the threshold is about 2S 
3.5). A monitor daemon is a fine-grain polling process 
because it frequently checks for the above three condi- 
tions, typically checking every 10 seconds. If one or 
more of the above three conditions is satisfied, the mon- 
itor daemon sends a signal to its LDSTU B process. This 30 
signal to its LDSTUB process starts a process, de- 
scribed below, which migrates the user's job to another 
remote host. In the example shown in FIG. 1, monitor 
daemon 108 checks for each of the above three condi- 
tions on remote host 101. -35 

The user's job is usually run on the remote host hav- 
ing been already linked with the Libckp checkpointing 
library. Libckp is a user-transparent checkpointing li- 
brary for Unix applications. Libckp can be linked with a 
user's program to periodically save the program state 40 
on stable storage without requiring any modification to 
the user's source code. Thecheckpointed program state 
includes the foltowing in order to provide a truly trans- 
parent and consistent migration: (i) the program coun- 
ter; (ti) the stack pointer; (iii) the program stack; (iv) open 45 
file descriptors; (v) global or static variables; (vi) the dy- 
namically alkx»ted memory of the program and of the 
libraries linked with the program; and (vii) all persistent 
state (i.e.. user files). User x's job 109, in this example, 
has been linked with Ubckp library 111. so 

Each of the processes shown in FIG. 1. namely 
processes 102 • 110, are active at this point in the job 
submission process. 

Up to this point in the description of the job submis- 
sion process, the focus has been on presenting the ss 
process by which a single job is submitted for remote 
execution by a single user. It shoukJ be noted, however, 
that a user of a local host may start a second job running 
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on a remote host before the first submitted job has fin- 
ished executing. This second job submisson is coordi- 
nated with the processes already running for the first job 
submissksn as folk^ws. 

First, the user starts the submit program a second 
time with the second job name as an argument (in the 
same manner as described above for the first job). This 
second starting of the submit program creates a second 
submit process just for managing the remote executksn 
of the second job. The second submit process spawns 
a second coshell client process (similar to coshell proc- 
ess 104) which is also just for the remote execution of 
the second job. The second coshell client process, how- 
ever, communicates with the same coshell which the 
' first coshell client process communicates with. 

The second coshell then selects a rerrxjte host for 
the second job to execute upon and starts a ksh on that 
remote host -- unless the coshell already has a ksh run- 
ning on that remote host. For example, if the coshell 
chooses the same remote host upon which the first job 
is executing, then the coshell will use the same ksh in 
which the first job is executing. Regardless of whether 
the coshel! needs to create a new ksh or not. within the 
selected ksh a second LDSTUB process, just for the re- 
mote execution of the second job. is started. The second 
LDSTUB process spawns the second job and a second 
monitor daemon. 

In general then, it can be seen that if a user has N 
jobs remotely executing whk;h have all been submitted 
from the same local host, then the user will have a 
unque set of submit, client. LDSTUB, monitor daemon 
and job processes created for each of the N jobs sub- 
mitted- All of the A/ jobs will share the same coshell dae- 
mon. Any of the N jobs executing on the same remote 
host will share the same ksh. 

The remainder of this description addresses the op- 
eration of the present invention with respect to a single 
remotely executing job for one user of a local host. For 
any additional renrratety executing jobs submitted by the 
same user from the same local host, each such addi- 
tional job will have its own unique set of independent 
processes which will respond to conditions in the same 
manner as described below. 

At the point in the job submission process when a 
user's job is remotely executing, one of two main con- 
ditions will occur: (i) one or more of the three conditions 
described above as being checked for by a nrK»nitor dae- 
mon is satisfied causing the user's job to be migrated to 
another remote host; or (it) the remotely executing job 
wit) finish. The process is initiated, at this point in the job 
submissksn process, by each of these two main condi- 
tions occurring as described, in turn, bek>w. 

If the first main condition (which may be one or more 
of the three conditions checked for by a monitor daemon 
occurring) is satisfied, the following steps will occur in 
order to migrate the user's job to another remote host. 
Each of these steps will be explcated by reference to 
the specific process configuration of FIG. 1 . 
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First, the satisfaction of the first main condition 
causes the nrxjnitor daemon to exit. The exiting of the 
monitor daenx>n causes a "SIGCHLD* signal to be sent 
to its parent LDSTUB process. In FIG. 1, the exiting of 
monitor daemon 108, upon the satisfaction of the first s 
main condition, causes a 'SIGCHLD" signal to be sent 
to monitor daemon 108*s parent process LDSTUB 107. 

Second, the LDSTUB process kills the user's job. 
In FIG. 1, LDSTUB process 107 kills user X's job 109. 
If the user's job has been linked with Libckp, as is the 
case for job 1 09, a subsequent restart of the job will only 
mean that work since the last checkpoint is lost. Typi- 
cally, Libckp performs a checkpoint every 30 minutes so 
that only the last 30 minutes of job execution, at most, 
will be lost. IS 

The third step is as follows. There is a socket con- 
nection between an LDSTUB process and its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it to 
have the user's job run on another remote host. LD- 20 
STUB process 107 sends such a message over socket 
1 1 2 to submit process 1 02. 

In the fourth step, the submit process kills the client 
process it had originally spawned off to communk:ate 
with the remote job through coshell. The submit process 2S 
also kills its LDSTUB process. !n the case of FIG. 1 , sub- 
mit process 102 kills client process 103 and LDSTUB 
process 107. 

In the ftfth step, the submit process resubmits the 
user's job to another renr»ote host according to the sub- 3o 
mission process described above. Submit process 102 
resubmits job 109 to coshell 104 for execution on an- 
other remote host. The subsequent processes and the 
other remote host which would be a part of job 109's 
resubmission are not shown in FIG, 1 . When the user's 35 
job is executed on another remote host, the first action 
of the job is to check for a checkpoint file. If the job finds 
a checkpoint file it restores the state of the job as of the 
point in time of the last checkpoint. 

It is important to note that the ksh under which the 
user's job had been running, along with its pipe connec- 
tions to its coshell. is kept running. This ksh may be used 
again later by its coshell provkjed that the remote host 
upon which the ksh is running becomes idle again by 
satisfying the three conditk)ns described above for a ^ 
candidate remote host. In the case of FIG, 1 , ksh 106 is 
kept running akxig with its command pipe 116 and sta- 
tus pipe 120 connecttons to coshell 104. 

Alternatively, if the second main condition de- 
scribed above (of the rennotety executing job finishing) so 
is satisfied, the folkdwing steps will occur As with the 
first main condition, each of these steps will be explicat- 
ed by reference to the specific process configuratkxi of 
FIG. 1. 

First, when the user's job finishes executk^ on a ss 
remote host, it exits causing a 'SIGCHLD* signal to be 
sent to its parent process (its LDSTUB process). In re- 
sponse to receiving the •SIGCHLD* signal. LDSTUB ob- 



tains the exit status number (a status number is returned 
by every exiting Unix process to indicate its completion 
status) of the user's job with the Unix system call wait 
(2). When user X's job 109 finishes executkDn on remote 
host 101 it exits causing a "SIGCHLD" signal to be sent 
to its parent LDSTUB process 107. LDSTUB process 
107 then obtains job 109's exit status number with the 
wait(2) system call. 

Second, the LDSTUB process kills its monitor dae- 
mon. LDSTUB process 107 kills monitor daemon 108. 

The third step is as follows. There is a socket con- 
nectk>n between the LDSTUB process and its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it that 
the user's job has finished and containing the status 
number returned by the user's job. LDSTUB process 
107 sends such a message over socket 112 to submit 
process 102. 

In the fourth step, the submit process kills the client 
process it had originally spawned off to communicate 
with the remote job through coshell. The coshell then 
detects that a particular client process has been killed 
by a partrcular signal number In response to detecting 
this the coshell sends a message to the appropriate ksh. 
The message sent by the coshell instnjcts the ksh to kill 
the LDSTUB process, corresponding to the submit proc- 
ess, using the same signal number by whk;h the submit 
process killed its client process. Jn the case of FIG. 1. 
submit process 102 kills client process 103. Coshell 104 
then detects that client process 103 has been killed by 
a particular signal number In response to detecting this, 
coshell 104 sends a message to ksh 106. The message 
sent by coshell 104 instojcts ksh 106 to kill LDSTUB 
process 107 with the same si^^l number by which sub- 
mit process 102 killed client process 103. 

In the fifth step, the submit process exrts with the 
same status number returned by the remotely executed 
job. Submit process 102 exits with the status number 
retumed by the execution of job 1 09 on remote host 101 . 

As with the migration process described above, it is 
important to note that the ksh ur»der which the user's job 
had been running, along with its pipe connections to its 
coshell. is kept running. This ksh may be used again 
later by its coshell provided that the remote host upon 
which the ksh is running becomes kjle again by satisfy- 
ing the three conditk^ns described above for a candidate 
renx>te host. In the case of FIG. 1 . ksh 106 is kept run- 
ning along with its corvnand pipe 116 and status pipe 
120 connections to coshell 104. 

The workstations shared through the present inven- 
tion can be all of one type (homogeneous workstations). 
An example is a network where only Sun Microsystems 
workstations can be shared. Anematively, the present 
invention can be used to share a variety of workstation 
types (heterogeneous workstatkxis). An example is a 
network where both Sun Mk^rosystems wort^stations 
and Silicon Graph k:s workstations can be shared. 

Where the present inventkxi is used with heteroge- 
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neous workstations, the migration of a job, based upon 
workstation type, is typk^alty as follows. Where no type 
is specified upon submitting a job. the job is limited to 
migrate among workstations of the same type as which 
the job was submitted from. The user can specify how- 
ever, upon submitting a job, that the job only be execut- 
ed upon workstatk^ns of one partk:ulartype and this type 
need not be the same as the type of workstation from 
which the job was submitted. The user can also specify, 
upon submitting a job, that the job can be executed on 
any workstation type. 

Persons skilled in the art win appreciate that the 
present invention may be practiced by other than the 
described embodiments, which are presented for pur- 
poses of illustration and not of limitation, and the present 
invention is limited only by the claims which follow. 
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6. The system of claim 1 , wherein the local computer 
executes a submissbn process in response to a job 
being submitted for remote execution by a user of 
the local computer 

7. The system of claim 1 , wherein the remote compu- 
ter also executes: 

a monitor process that monitors the activity of 
primary users of the remote computer 

8. The system of claim 1 . wherein the activity informa- 
tion includes time elapsed since a last activity on an 
input devrce by a primary user of the remote com- 
puter. 

9. The system of claim 8, wherein the input device is 
directly connected to the remote computer. 



Claims 

1 . A system for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 



10. The system of claim 8, wherein the input device is 
connected to the remote computer through a re- 
mote login process. 

1 1 . The system of claim 1 , wherein the scheduling com- 
puter chooses a remote computer as the computer 
to remotely execute a job only if the time elapsed 
since a last activity on the input device by a primary 
user of the remote computer is greater than a pre- 
determined threshold. 

1 2. The system of claim 1 . further comprising a file sys- 
tem shared by at least the local and remote com- 
puters. 

13. The system of claim 12. wherein a primary user of 
the remote computer is determined by attribute in- 
formation stored in a central kx;ation in the file sys- 
tem. 

14. The system of claim 13. wherein the scheduling 
computer chooses a kxation based on current in- 
formation kept in the file system and the current sta- 
tus process on each remote computer posts the cur- 
rent information it collects to the central tocation in 
the file system. 

15. A system for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 



a local computer that accepts jobs for remote 2S 
executkjn, the host computer being one of the 
plurality of computers; 

a remote computer that is capable of receiving 
jobs for remote execution, the renrtote computer 
being one of the plurality of computers and ex- 30 
ecuting a current status process to collect cur- 
rent information that includes activity informa- 
tion about primary users of the remote compu- 
ter; and 

a scheduling computer that chooses a kxation 35 
for remote execution of the accepted jobs 
based upon at least the primary user current 
information regarding the remote computer, the 
scheduling computer being one of the plurality 
of computers. 40 

2. The system of claim 1 , wherein the scheduling com- 
puter and the kx:al computer are the same compu- 
ter 

45 

3. The system of claim 1 , wherein the scheduling com- 
puter and the remote computer are the same com- 
puter. 



4. The system of claim 1 . wherein the local computer, so 
the remote computer and the scheduling computer 
are three different computers of the plurality of com- 
puters. 

5. The system of claim 1 , wherein the scheduling com- ss 
puter capability of choosing a location for remote 
executksn is executed as a decentralized schedul- 
ing process among the plurality of computers. 



a local computer that accepts jobs that a user 
has submitted to the kx^al computer for execu- 
tion on a remote computer; and 
a remote computer, selected by a scheduling 
process, that executes the job submitted by the 
user, and executes a monitor process that 
causes the execution of the job to be suspend- 
ed if the monitor process detects activity on the 
remote computer by a primary user of the re- 
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mote computer. 

16. The system of claim 15, wherein the remote com- 
puter suspends execution of the job if activity by a 
primary user of the remote computer is detected on 
an input device of the remote computer. 



17. 



The system of claim 1 5, wherein the local computer 
executes a submission process such that if the job 
is suspended by the remote computer, the job is re- 
submitted by the submission process to the sched- 
uling process for execution on a second remote 
computer. 



18. The system of claim 15, wherein the monitor proc- 
ess is spawned from a parent process. 



IS' 



the second computer being one of the remotely 
accessible computers, the current status proc- 
ess operating to collect current information 
about the second computer 

27. The method of claim 26, further comprising the step 
of: 

executing a submission process on the first 
computer in response to a job being submitted by a 
user of the first computer. 

28, The method of claim 26, wherein the step of running 
a current status process collects activity information 
that includes time elapsed since a last activity on 
an input device by a primary user of the second 
computer. 



19. The system of claim 15, wherein the monitor proc- 
ess is a fine-grain polling process. 

20 

20. The system of claim 15, wherein the job is linked 
with a checkpointing library. 

21. The system of claim 15. further comprising a file 
system shared by at least the local and remote com- 2S 
puters. 

22. The system of claim 21 , wherein a primary user of 
the remote computer is determined by attribute in- 
formation stored in a central location in the file sys- 30 
tem. 

23. The system of claim 15. wherein the schedulina 



24. 



25. The system of claim 15. wherein the scheduling 40 
process is executed by a single processor for the 
entire network. 

26. A method of sharing computer resources among a 
plurality of computers connected together by a net- 
work comprisng the steps of: 



29. The method of claim 26, wherein the step of exe- 
cuting a scheduling process chooses a remote com- 
puter for remote execution of the submitted job only 
If the time elapsed since a last activity on an input 
device by a primary user of the remote computer is 
greater than a predetermined threshold, the remote 
computer being one of the second computers. 

30. The method of claim 29, further comprising the 
steps of: 

executing one of the accepted jobs on the re- 
mote computer chosen by the scheduling proc- 
ess; and 

monitoring the remote computer for activity by 
a primary user of the remote computer. 

31 . The method of claim 30, further comprising the step 
of: 

suspending the remote execution of the job 
when activity by a primary user is sensed during the 
step of monitoring. 

32. The method of claim 31 , further comprising the step 
of: 

resubmitting the suspended job to the sched- 
uling process for execution on a different remote 
computer when activity by a primary user is sensed 
during the step of monitoring. 

33. The method of claim 30. further comprising the step 
of: 

linking the job to a checkpointing library prior 
to executing the job. 

34. The method of claim 26. wherein the step of exe- 
cuting a scheduling process is executed as a de- 
centralized scheduling process. 

35. The method of claim 26, wherein the step of exe- 
cuting a scheduling process is executed as a cen- 



accepting jobs for remote execution on a first 
computer; 

executing a scheduling process to schedule the so 
renrtote execution of the accepted jobs based 
on current Information about each of the plural- 
ity of computers that may be remotely ac- 
cessed, the current informatkxi including activ- 
ity infonnation about primary users of each of ss 
the remotely accessible computers; and 
running a current status process on at least one 
second computer of the plurality of computers. 



The system of claim 15. wherein the scheduling 
process is a decentralized scheduling process. 

The system of claim 23, wherein the decentralized 
scheduling process is executed on the local com- 
puter that submits a job for remote execution. 
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tralized scheduling process on a single processor. 
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